PPNet: Identifying Functional Association Networks by Phylogenetic Profiling of Prokaryotic Genomes

ABSTRACT Identification of microbial functional association networks allows interpretation of biological phenomena and a greater understanding of the molecular basis of pathogenicity and also underpins the formulation of control measures. Here, we describe PPNet, a tool that uses genome information and analysis of phylogenetic profiles with binary similarity and distance measures to derive large-scale bacterial gene association networks of a single species. As an exemplar, we have derived a functional association network in the pig pathogen Streptococcus suis using 81 binary similarity and dissimilarity measures which demonstrates excellent performance based on the area under the receiver operating characteristic (AUROC), the area under the precision-recall (AUPR), and a derived overall scoring method. Selected network associations were validated experimentally by using bacterial two-hybrid experiments. We conclude that PPNet, a publicly available (https://github.com/liyangjie/PPNet), can be used to construct microbial association networks from easily acquired genome-scale data. IMPORTANCE This study developed PPNet, the first tool that can be used to infer large-scale bacterial functional association networks of a single species. PPNet includes a method for assigning the uniqueness of a bacterial strain using the average nucleotide identity and the average nucleotide coverage. PPNet collected 81 binary similarity and distance measures for phylogenetic profiling and then evaluated and divided them into four groups. PPNet can effectively capture gene networks that are functionally related to phenotype from publicly prokaryotic genomes, as well as provide valuable results for downstream analysis and experiment testing.

1. As I understand the tool relies heavily on Roary to find homology between genes and build phylogenetic relationships. I find it a bit risky that the core of a new tool depends on a pipeline that hasn't been supported for 4 years (soon 5 years). In my opinion, this compromises the longevity of PPNet.
2. It is unclear how PPNet handles paralogs. What are the criteria for having only orthologous genes and excluding paralogs?
3. There is a tool, kover (https://aldro61.github.io/kover/index.html), which uses a machine learning approach and also allows genomic sequences to be associated with phenotypes. Authors should discuss the advantages of their tools over published ones.
4. PPNet requires that strain groups be created by the user (virulent vs non-virulent, for example). Sometimes it can be difficult to categorize certain strains. How resilient is PPNet? In this sense, if a strain is misclassified, what would be the impact on the results? 5. I believe Figure 2a is not cited in the text.
6. The authors have chosen to place non-typeable strains of S. suis in the non-virulent group. Why ? Is it so impossible for virulent strains to be untypable?
Reviewer #2 (Comments for the Author): Summary Li et al. have developed PPNet, a useful bioinformatic tool, implemented in python, to classify pairs of functionally related bacterial proteins, within a bacterial species, using what is referred to as "phylogenetic profiles" (a binary matrix of gene presence/absence) of all non-core proteins in the pangenome. They selected high quality publicly available S. suis genomes and assigned them into two groups of virulent and non-virulent strains based on in silico serotyping. These were then filtered for core genes and parts of the accessory genome were shown to correlate with virulent and non-virulent groups. Parts of the accessory genome were then constructed into functionally associated networks of orthologs, and some were validated using StringDB and experimentally using bacterial two-hybrid systems. PPNet can be a useful functional tool for pangenomics and assignment of functional networks of non-core proteins. It is particularly useful for understudied species and accessory genomes are typically less studied than the core genome. It could be used in conjunction with GWAS either upstream or downstream. PPNet as a "phenotype-to-gene presence/absence" association tool is not a novel concept and more robust tools exist for the same in the microbial GWAS field. However, PPNet as a fast and efficient functional association network tool for relevant accessory genes, is a novel and valuable resource for researchers like myself and others as an important step in analysis and validation of virulent genes in diverse bacterial omics datasets. Major Comments 1) Lines 139-144: It is unlikely that serotypes and virulence correlate as well as the paper suggests to classify the genomes into virulent and non-virulent. The authors attempted to work around this problem with DAPC but it needs to be further explained how the accessory genome was filtered into 1060 VRDGs either in the results or methods. Alternative to the DAPC method, authors could validate their phenotype associated orthologs with a more robust method other than selection of 100 validated principal components for DAPC. Authors have used the tool Roary for ortholog determination, but gene presence/absence microbial GWAS can be conducted using Scoary (or others), another tool by the same group for statistically significant phenotype associated separation of ortholog data. Since this is an already established concept in the phenotype association field, these tools should be used or compared in relation to PPNet, and any advantages or disadvantages of both methods should be highlighted.
2) The paper doesn't discuss or address the possibility of an ortholog of accessory genes having minor differences in sequence which associate with one phenotype over another or affect functional association in one group over the other. It is worth discussing this, as other tools operate at the SNP and unitig level to identify such variants. PPNet would likely not detect these differentially present gene networks as they would not separate out at the ortholog level.
3) While PPNet is simple to download and install as the paper suggests. Once installed it seems to be missing a key python script (average_nucleotide_identity.py) rendering it non-functional. Authors should fix the github version and attempt installation and execution of PPNet on an independent system. 4) The data presented in Figure 3 for choice of similarity/distance measure (Eqn. 60) doesn't' match the data referenced in the Supplementary Table 2. In the table, AUROC and AUPR values exceed 100% and this has not been explained nor do the values always correspond to the ones in the figure. The chosen equation (C1 Eqn60) for the network construction is the only one with an AUROC >100% listed in the table. 5) Input datasets of binary accessory genomes (used in DAPC and network construction) could be provided as a supplemental table for review. This could also be of benefit to the people who study S. suis as a database of S suis accessory orthologs. Figure Legend 1 -ANI and ANC thresholds are described as user-adjustable but the current version of PPNet appears to be hard coded. -The figure clearly depicts what is meant by the term "phylogenetic profile". However, this should be better explained in the text as just a binary matrix as "phylogenetic profile" is a bit vague and could be referring to other things like phylogenetic tree topology. Figure Legend 2 -Panel C: It needs to be mentioned that the tree is constructed using the accessory ortholog profile in the legend as it is mentioned in the text. This could be misinterpreted as a whole genome tree.

Minor Comments
-Panel D has -log10(padj.) mentioned but it's not clear what these values reflect. Figure Legend 3 -Panel a cites Supplemental Table 5 when it should be Supplemental Table 2 -Panel b highlights the relevance of Eqn. 60 for S suis but the colored boxes are not explained. Figure Legend 5 -Cite the Supplemental table listing each "Mx" number and associated genes in the legend. Line 38 -typo: experimental -> experiments Lines 44-47 -same as the abstract Line 128 -typo: publically ->publicly Line 156-157 -significantly distributed is a misnomer as no direct gene presence-absence statistics were present. Perhaps "overrepresented in virulent genomes" Staff Comments:

Preparing Revision Guidelines
To submit your modified manuscript, log onto the eJP submission site at https://spectrum.msubmit.net/cgi-bin/main.plex. Go to Author Tasks and click the appropriate manuscript title to begin the revision process. The information that you entered when you first submitted the paper will be displayed. Please update the information as necessary. Here are a few examples of required updates that authors must address: • Point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT IN YOUR COVER LETTER.
• Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file. • Each figure must be uploaded as a separate file, and any multipanel figures must be assembled into one file. For complete guidelines on revision requirements, please see the journal Submission and Review Process requirements at https://journals.asm.org/journal/Spectrum/submission-review-process. Submissions of a paper that does not conform to Microbiology Spectrum guidelines will delay acceptance of your manuscript. " Please return the manuscript within 60 days; if you cannot complete the modification within this time period, please contact me. If you do not wish to modify the manuscript and prefer to submit it to another journal, please notify me of your decision immediately so that the manuscript may be formally withdrawn from consideration by Microbiology Spectrum.
If your manuscript is accepted for publication, you will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publication Fees, including supplemental material costs, please visit our website.
Corresponding authors may join or renew ASM membership to obtain discounts on publication fees. Need to upgrade your membership level? Please contact Customer Service at Service@asmusa.org.
Thank you for submitting your paper to Microbiology Spectrum.

Summary
Li et al. have developed PPNet, a useful bioinformatic tool, implemented in python, to classify pairs of functionally related bacterial proteins, within a bacterial species, using what is referred to as "phylogenetic profiles" (a binary matrix of gene presence/absence) of all non-core proteins in the pangenome. They selected high quality publicly available S. suis genomes and assigned them into two groups of virulent and non-virulent strains based on in silico serotyping. These were then filtered for core genes and parts of the accessory genome were shown to correlate with virulent and non-virulent groups. Parts of the accessory genome were then constructed into functionally associated networks of orthologs, and some were validated using StringDB and experimentally using bacterial two-hybrid systems.
PPNet can be a useful functional tool for pangenomics and assignment of functional networks of non-core proteins. It is particularly useful for understudied species and accessory genomes are typically less studied than the core genome. It could be used in conjunction with GWAS either upstream or downstream. PPNet as a "phenotype-to-gene presence/absence" association tool is not a novel concept and more robust tools exist for the same in the microbial GWAS field. However, PPNet as a fast and efficient functional association network tool for relevant accessory genes, is a novel and valuable resource for researchers like myself and others as an important step in analysis and validation of virulent genes in diverse bacterial omics datasets.

Major Comments
1) Lines 139-144: It is unlikely that serotypes and virulence correlate as well as the paper suggests to classify the genomes into virulent and non-virulent. The authors attempted to work around this problem with DAPC but it needs to be further explained how the accessory genome was filtered into 1060 VRDGs either in the results or methods.
Alternative to the DAPC method, authors could validate their phenotype associated orthologs with a more robust method other than selection of 100 validated principal components for DAPC. Authors have used the tool Roary for ortholog determination, but gene presence/absence microbial GWAS can be conducted using Scoary (or others), another tool by the same group for statistically significant phenotype associated separation of ortholog data. Since this is an already established concept in the phenotype association field, these tools should be used or compared in relation to PPNet, and any advantages or disadvantages of both methods should be highlighted.
2) The paper doesn't discuss or address the possibility of an ortholog of accessory genes having minor differences in sequence which associate with one phenotype over another or affect functional association in one group over the other. It is worth discussing this, as other tools operate at the SNP and unitig level to identify such variants. PPNet would likely not detect these differentially present gene networks as they would not separate out at the ortholog level.
3) While PPNet is simple to download and install as the paper suggests. Once installed it seems to be missing a key python script (average_nucleotide_identity.py) rendering it non-functional. Authors should fix the github version and attempt installation and execution of PPNet on an independent system. Figure 3 for choice of similarity/distance measure (Eqn. 60) doesn't' match the data referenced in the Supplementary Table 2. In the table, AUROC and AUPR values exceed 100% and this has not been explained nor do the values always correspond to the ones in the figure. The chosen equation (C1 Eqn60) for the network construction is the only one with an AUROC >100% listed in the table.

4) The data presented in
5) Input datasets of binary accessory genomes (used in DAPC and network construction) could be provided as a supplemental table for review. This could also be of benefit to the people who study S. suis as a database of S suis accessory orthologs.

Minor Comments
Figure Legend 1 -ANI and ANC thresholds are described as user-adjustable but the current version of PPNet appears to be hard coded.
-The figure clearly depicts what is meant by the term "phylogenetic profile". However, this should be better explained in the text as just a binary matrix as "phylogenetic profile" is a bit vague and could be referring to other things like phylogenetic tree topology.   Table 5 when it should be Supplemental Table 2 -Panel b highlights the relevance of Eqn. 60 for S suis but the colored boxes are not explained. The manuscript describes a bioinformatics tool, PPNet, which makes it possible to make associations between certain phenotypes and genes in bacteria. Although the idea is interesting, I have a few comments:

As I understand the tool relies heavily on Roary to find homology between genes and build phylogenetic relationships. I find it a bit risky that the core of a new tool depends on a pipeline that hasn't been supported for 4 years (soon 5 years). In my opinion, this compromises the longevity of PPNet.
Response: We appreciate your professional suggestion. It's true that Roary hasn't had a major version update in a long time, but Roary is still one of the most used canonical pan genome pipelines. We found the software has been used and cited more than 630 times in 2022 (data sourced from Google Scholar), including in not only Microbial Spectrum but other highly-rated journals, for example:

It is unclear how PPNet handles paralogs. What are the criteria for having only orthologous genes and excluding paralogs?
Response: We are sorry that we didn't clarify this important issue in the manuscript and we thank you very much for your constructive comments.
Paralogs are identified as genes with near identical sequence may perform a different function or under differential regulation. For this reason, we adopted the default setting of Roary, which splits paralogs from homologous groups into groups of true orthologs by using conserved gene neighborhood information. We have added this information to the "MATERIALS AND METHODS" section in the revised manuscript (lines 330-334): "PPNet uses Prokka (63) to automate the annotation all genomes, and then extracts the GFF3 format annotation files from the output files as the input files for Roary (31) to construct the phylogenetic profile, we adopted the default setting of Roary, which splits paralogs from homologous groups into groups of true orthologs by using conserved gene neighborhood information." 3. There is a tool, kover (https://aldro61.github.io/kover/index.html), which uses a machine learning approach and also allows genomic sequences to be associated with phenotypes. Authors should discuss the advantages of their tools over published ones.
Response: We thank the reviewer for the constructive comments. Kover is an excellent software, and we have added some discussion to the revised manuscript (lines 243-249): "Another programme for determining genes associated with phenotypes is Kover, a k-mer-based software using machine learning algorithms which allows users to find some k-mers (sequences of k length) that are associated with phenotype (41). Kover recognises k-mer presence/absence rather than gene presence/absence, and is convenient to test other types of representations for genomic variants, such as single nucleotide polymorphisms (SNPs) and unitig level. However, Kover is less user friendly as users need to further annotate through sequence alignment to identify the cognate gene. Response: Thank you very much for your constructive comments. The correct grouping of strains is important, therefore, PPNet requires the users to have certain understanding of their strains. If some strains are difficult to group due to lack of certain theoretic support, we suggest that these strains can be filtered out to reduce the impact on the results. In our study, we grouped strains according to the epidemiological surveys. However, epidemiology can only ensure that most strains can be grouped correctly, and there will inevitably be cases in the literature that are misclassified. Nevertheless, PPNet finally identified 1060 VRDGs based on Fisher's exact test, many of which have been confirmed by previous studies. In addition, the functional association network of S. suis predicted by PPNet was shown to be reliable by AUROC, AUPR, overall scoring and bacterial two-hybrid experimental. Therefore, PPNet is a resilient software.

I believe Figure 2a is not cited in the text.
Response: We sincerely apologize for our mistake. Figure 2a  Lines 102 -105: "In addition, PPNet divides strains into two groups based on phenotypic information provided by the user, and compares the distribution of each ortholog from different phenotypic groups, and only the phylogenetic profile of orthologs with significantly different distributions is selected for network inference (Fig. 2a)." Lines 138-140: "To obtain more valuable phylogenetic profiles, PPNet identifies virulence-related differential genes (VRDGs) by comparing the distribution of genes from virulent and non-virulent genomes (Fig. 2a)."

The authors have chosen to place non-typeable strains of S. suis in the nonvirulent group. Why? Is it so impossible for virulent strains to be untypable?
Response: Thanks for your valuable comments. We placed non-typeable strains Response: We are so sorry that the description of this part is not clear enough.
Discriminant Analysis of Principle Components (DAPC) was used in our study to determine whether the genotypes of the isolates (accessory genome) were distinct between virulent and non-virulent strains. The 100 validated principal components were determined by cross-validation, which accounted for approximately 82.73% of the total genetic variability (We sincerely apologize for our mistake that the original manuscript is wrongly written as "18%", "82.73%" were calculated from the following figure, and "18%" has been corrected to "82.73%" in the line 384 of revised manuscript). Then, we filtered 1060 VRDGs with Fisher's exact test, which were significantly distributed in virulent genomes.
DAPC is a core technique used in many population genetic studies (Miller et al., 2020). Lines 138-147: "To obtain more valuable phylogenetic profiles, PPNet identifies virulence-related differential genes (VRDGs) by comparing the distribution of genes from virulent and non-virulent genomes (Fig. 2a). Each gene receives its own null hypothesis of no association to virulence, and a Fisher's test is performed (see Method). VRDGs are defined as gene families that are overrepresented in virulent genomes. A total of 1060 VRDGs were identified, and phylogenetic profiles of VRDGs were used to infer an association network (Table S2). Figure 2d shows that VRDGs were predominantly present in virulent genomes when compared with non-virulent genomes. Finally, PPNet generated a total of 81 virulence-related gene association networks based on 81 binary similarity and dissimilarity measures (32, 33) (Table S3)" Lines 335-342: "In order to obtain a more valuable phylogenetic profile for network prediction, PPNet requires strain grouping information and creates a 2 × 2 contingency table, the levels being presence or absence for the trait and gene, respectively, with counts of the number of isolates in each cell. For each gene, we assume the null hypothesis i.e. it is independent of virulence, and uses Fisher's exact test to compute p-values. Finally, p-values were corrected by the False Discovery Rate. Genes with adjusted p value (p.adj) < 0.05 are considered phenotype-related differential genes, and the phylogenetic profile of these differential genes was used for network inference."

Authors have used the tool Roary for ortholog determination, but gene
presence/absence microbial GWAS can be conducted using Scoary (or others), another tool by the same group for statistically significant phenotype associated separation of ortholog data. Since this is an already established concept in the phenotype association field, these tools should be used or compared in relation to PPNet, and any advantages or disadvantages of both methods should be highlighted.
Response: According to the reviewer's suggestion, we have compared Scoary and our approach in the discussion section (lines 243-255): Lines 243-255: "Another programme for determining genes associated with phenotypes is Kover, a k-mer-based software using machine learning algorithms which allows users to find some k-mers (sequences of k length) that are associated with phenotype (41). Kover recognises k-mer presence/absence rather than gene presence/absence, and is convenient to test other types of representations for genomic variants, such as single nucleotide polymorphisms (SNPs) and unitig level. However, Kover is less user friendly as users need to further annotate through sequence alignment to identify the cognate gene. Response: We sincerely apologize for our mistake. Thanks for your valuable comments. PPNet requires users to install pyani module first, which includes the "average_nucleotide_identity.py" script. We have added this description in the "README.md" file of the github and installed and executed PPNet on an independent system. Figure 3 for choice of similarity/distance measure (Eqn. 60) doesn't' match the data referenced in the Supplementary    Table 5 when it should be Supplemental Table 2 Response: We sincerely apologize for our mistake. We have corrected this mistake (lines 722-723):

The data presented in
" Figure